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Probabilistic  reasoning  typically  suffers  from  the  explosive  amount  of  information 
it  must  maintain.  There  are  a  variety  of  methods  available  for  curbing  this  explosion. 
However,  in  doing  so,  it  is  important  to  avoid  oversimplifying  the  given  domain  through 
injudicious  use  of  assumptions  such  as  independence.  Multiple  splining  is  an  approach 
for  compressing  and  approximating  the  probabilistic  information.  Instead  of  positing 
additional  independence  conditions,  it  attempts  to  identify  patterns  in  the  information. 
While  the  data  explosion  is  multiplicative  in  nature,  0{n\ri2-  ■  -rik),  multiple  splines 
reduces  it  to  an  additive  one,  0(m  +  n 2  +  •  •  •  +  rik)-  We  consider  how  these  splines 
can  be  found  and  used.  Since  splines  exploit  patterns  in  the  data,  we  can  also  use  them 
to  help  in  filling  in  missing  data.  As  it  turns  out,  our  splining  method  is  quite  general 
and  may  be  applied  to  other  domains  besides  probabilistic  reasoning  which  can  benefit 
from  data  compression. 


1.  Introduction 

Managing  large  amounts  of  probabilistic  information  has  been  of  particular 
interest  to  researchers  modeling  uncertainty.  We  can  capture  a  lot  of  detail  for  any 
given  domain  by  representing  knowledge  in  terms  of  correlations.  Combined  with  the 
strong  semantics  of  probability  theory,  we  can  manipulate  it  in  a  sound  and  rigorous 
fashion.  Unfortunately,  as  we  often  find,  nearly  everything  cross-correlates  in  the 
strictest  sense  causing  a  tremendous  data  explosion. 

We  look  in  particular  at  a  class  of  probabilistic  models  called  Bayesian  networks 
[6],  Other  models  such  as  influence  diagrams  [13],  Markov  random  fields  [3],  simi¬ 
larity  networks  [4],  and  Markov  networks  [6]  can  also  be  considered  since  they  are 
very  closely  related.  However,  for  our  puiposes,  Bayesian  networks  will  be  sufficient 
since  our  discussion  will  be  readily  applicable  to  the  other  models  as  we  shall  see. 


*  This  paper  was  supported  in  part  by  AFOSR  Project  #940006  and  by  Rome  Labs  Project  Number 
55812769  USAF.  Thanks  to  the  anonymous  reviewers  who  helped  improve  this  paper. 
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In  Bayesian  networks,  objects  and/or  events  are  represented  by  random  vari¬ 
ables.  1  Correlations  between  these  objects  are  modeled  as  conditional  probabilities. 
In  order  to  avoid  the  correlations  explosion,  simplifying  assumptions  concerning  in¬ 
dependence  conditions  are  made  about  the  target  domain.  In  particular,  conditional 
independencies  are  asserted  between  collections  of  r.v.s.  Assume  that  each  r.v.  is 
designated  by  some  unique  node  in  a  directed  graph.  Given  a  r.v.  A,  let  W\  be  the 
immediate  parents  of  A.  For  any  set  of  r.v.s  W2  such  that  A  is  not  an  ancestor  of 
any  of  those  in  W2,  P(A  \  W\ ,  W2)  =  P(A  \  ITj).  Furthermore,  such  a  graph  must 
be  acyclic.  These  networks  are  also  called  causal  networks  since  we  can  interpret  the 
conditionals  as:  If  ITj,  then  A  with  probability  P(A  |  Hj ) . 

To  complete  the  formulation  of  Bayesian  networks,  each  node  has  an  associated 
table  of  conditional  probabilities.  For  example,  node  A  will  have  the  table  consisting  of 
P(A  |  Hj  )  for  all  possible  assignments  to  A  and  H  j .  The  power  of  Bayesian  networks 
lies  in  the  fact  that  because  of  the  above  conditional  independence  structure,  it  is 
relatively  straightforward  to  compute  the  joint  probabilities  of  any  complete  assignment 
to  all  the  r.v.s.  We  simply  look  up  the  corresponding  probabilities  in  each  of  the 
conditional  tables  and  multiply  them  together.  2  For  example,  consider  the  network  in 
Fig.  1.  The  joint  probability  for  the  assignment  {A  =  F,  B  =  T,C  =  F,  D  =  T,  E  = 
/■'}  is 


P(A  =  F,B  =  T,C  =  F,D  =  T,E  =  F) 

=  P(E  =  F\C  =  F)P(D  =  T  |  A  =  F,C  =  F) 
x  P{C  =  F\B  =  T)P(B  =  T)P(A  =  F), 

which  computes  to  0.21168. 

The  total  number  of  conditional  probabilities  we  must  maintain  in  a  Bayesian 
network  is  simply  the  sum  of  the  number  of  entries  in  each  table.  The  size  of  the  table 
is  governed  by  the  number  of  r.v.s  involved,  the  node  and  all  its  parents,  and  by  the 
number  of  possible  assignments  to  each  r.v.  Clearly,  the  table  size  is  a  multiplicative 
factor  but  nonetheless  is  an  improvement  over  the  completely  cross-correlated  case 
without  independence  assumptions. 

Organizing  r.v.s  in  an  effort  to  curb  the  probabilities  explosion  has  been  applied 
to  domains  such  as  vision  and  pattern  recognition.  For  example,  hierarchically  group¬ 
ing  sensors  into  vertical,  horizontal,  etc.  line  detectors  reduces  the  overall  connectivity 
of  the  graph.  However,  we  must  keep  in  mind  that  these  tend  to  be  overly-simplistic 
assumptions  resulting  in  unrealistic  solutions.  Furthermore,  connectivity  is  only  one 
of  the  factors  in  table  size.  Independence  does  not  impact  on  the  number  of  possible 
assignments  for  a  given  r.v. 


1  We  abbreviate  “random  variables”  to  “r.v.”  throughout  the  rest  of  this  paper.  We  also  assume  that  the 
r.v.s  are  discrete. 

2  For  more  detailed  information,  see  Bayes’  theorem  in  [6], 
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Fig.  1.  Simple  Bayesian  network. 


Consider  the  case  where  we  have  designated  six  sonars  located  around  some 
mobile  robot  for  use  in  determining  its  present  location.  Treat  each  sonar  as  a  r.v. 
with  assignments  based  on  the  possible  different  sonar  readings.  If  we  discretize  our 
readings  too  much,  we  lose  a  great  deal  of  accuracy.  On  the  other  hand,  with  only  10 
values  each,  we  still  have  a  minimum  of  106  possible  combinations. 

Aside  from  space  considerations,  how  does  this  all  impact  our  reasoning  al¬ 
gorithms?  Reasoning  with  Bayesian  networks  has  been  shown  to  be  NP-hard  [1,  2, 
15]  except  for  special  classes  such  as  poly  tress  which  are  polynomial  with  respect  to 
the  number  of  nodes.  Unfortunately,  even  for  the  polytrees,  each  node  visit  entails 
a  complete  search  through  the  associated  table  [6,  18].  Hence,  large  tables  clearly 
compound  our  computational  problems. 

Recently,  alternative  approaches  to  the  table  size  problem  have  been  developed 
[8,  12,  14].  Instead  of  relying  on  independence  assumptions,  a  search  is  made  for 
possible  patterns  in  the  conditional  tables.  Once  some  regular  pattern  is  identified, 
we  hope  to  encode  it  into  some  compact  representation  allowing  us  to  dispense  with 
explicitly  storing  the  table. 

One  method  is  founded  on  something  called  independence-based  assignments 
(abbrev.  IBMAPs)  [12,  14].  In  this  approach,  we  attempt  to  find  a  specific  set  of  entries 
in  the  table  which  share  the  same  probability.  Let  A  be  some  r.v.  and  B\ , . . . ,  Bn ,  C  be 
the  immediate  parents  of  A.  If  for  some  set  of  assignment  values  for  A,  B\ , . . .  ,  Bn, 
say  a,bi,...,bn, 

P{A  =  a  |  Bi  =  bi, . . .  ,Bn  =  bn,C  =  c) 

are  identical  for  all  possible  assignments  c  to  C,  then  we  can  conclude  that  under  the 
assignments  a,b\, . . .  ,bn,  any  assignment  to  C  will  have  no  effect  on  the  conditional 
probability.  Hence,  we  can  throw  away  all  those  redundant  probabilities  and  replace 
them  by  the  single  probability 

P{A  =  a  |  Bi  =  6i,. . .  ,Bn  =  bn). 

Note,  C  can  be  generalized  to  a  set  of  r.v.s  as  well. 
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Another  method  called  “Noisy-OR”  models  [6,  7,  17]  also  exploits  domain 
dependent  information  by  attempting  to  identify  disjunctive  interactions  (and  its  gen¬ 
eralizations)  between  r.v.s.  Once  identified,  we  can  reduce  the  amount  of  information 
necessary  for  computing  the  probabilities. 

A  third  method  is  based  on  the  idea  of  finding  a  real  function  called  a  linear 
potential  function  (abbrev.  LPF)  which  can  ideally  interpolate  through  all  the  points 
in  the  table  [8].  The  arguments  to  the  function  would  be  the  assignments  to  the 
r.v.s.  Since  assignments  to  r.v.s  are  often  abstract  objects  or  concepts,  we  must  map 
them  to  real  values.  At  the  same  time,  the  choice  of  mappings  will  directly  affect 
the  quality  of  the  approximation.  For  example,  consider  the  simple  case  where  we 
have  a  table  consisting  of  only  one  r.v.,  say  A,  representing  the  color  of  a  ball.  Let 
{red, green, yellow,  blue,  purple}  be  the  possible  colors  and  let  the  table  be 

P(A  =  red)  =  0.02, 

P(A  =  green)  =  0.50, 

P(A  =  yellow)  =  0.00, 

P(A  =  blue)  =0.37, 

P{A  =  purple)  =  0.11. 

First,  let’s  simply  map  red  to  1,  green  to  2,  yellow  to  3,  etc.  We  get  the  resulting  plot 
in  Fig.  2.  Now,  let’s  choose  a  more  intelligent  mapping  scheme  as  in  Fig.  3.  The 
approximation  function  for  our  second  mapping  is  simply  a  straight  line  through  all 
the  points  as  opposed  to  the  awful  polynomial  interpolation  curve  we  would  need  for 
the  first  mapping.  Thus,  the  problem  also  involves  choosing  the  right  mapping  to  help 
insure  that  we  can  find  a  simple  enough  function  that  can  reasonably  approximate  the 
table.  Once  a  suitable  function  is  found,  we  can  again  throw  away  the  table. 

Clearly,  the  independence-based  method  is  much  simpler  and  easier  to  imple¬ 
ment.  In  fact,  we  only  need  to  make  a  very  minor  modification  to  the  reasoning 
computations  [14].  There  is  no  loss  of  precision  with  this  compaction  scheme.  FIow- 
ever,  requiring  identical  probabilities  in  the  entries  is  very  restrictive.  Flence,  this 
will  generally  result  in  relatively  minor  savings.  With  the  “Noisy-OR”  models,  the 
problem  lies  in  successfully  identifying  a  significant  number  of  interactions. 

LPFs  on  the  other  hand  can  provide  tremendous  reductions.  Furthermore,  there 
is  a  closed  form  solution  for  computing  the  best  one.  Flowever,  we  trade  off  these 
savings  with  reduced  accuracy  since  LPFs  are  generally  only  an  approximation. 3  Yet, 
with  the  LPF  as  well,  little  change  to  the  reasoning  algorithms  is  necessary.  Much 
can  be  gained  especially  when  searching  for  the  optimal  value  in  the  table.  As  we 
mentioned  above,  most  of  the  algorithms  such  as  [18]  must  spend  time  in  searching 
an  unordered  table.  Since  LPFs  are  real  functions,  we  could  potentially  use  simple 
techniques  such  as  taking  its  first  derivative  to  search  for  the  optimal  value  even 


3  Although  we  can  generally  interpolate  all  the  points  perfectly  given  a  high  enough  degree  polynomial, 
such  a  function  would  be  much  too  complex  to  find  and  use. 
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Fig.  2.  Encoding  1  for  A. 
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Fig.  3.  Encoding  2  for  A. 
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under  various  r.v.  assignment  restrictions.  Furthermore,  they  can  be  directly  used  in 
an  integer  linear  programming  algorithm  for  reasoning  with  Bayesian  networks  [8]. 

Aside  from  the  space  and  computational  aspects  of  the  methods  above,  the 
pattern  matching  approaches  of  IBMAPs  and  LPFs  can  also  help  deal  with  incomplete 
information.  One  of  the  drawbacks  of  using  Bayesian  networks  is  the  requirement 
that  the  conditional  tables  be  complete,  i.e.,  all  combinations  of  conditionals  must  be 
available.  Lacking  completeness,  computations  such  as  determining  the  most  probable 
joint  distribution  cannot  be  performed.  Flowever,  we  might  make  intelligent  estimates 
for  these  missing  values  via  the  patterns  we’ve  discovered  using  the  above  techniques. 

In  this  paper,  we  look  further  at  pattern  matching  techniques  to  help  reduce  the 
table  explosions.  In  particular,  we  look  more  closely  at  LPFs  since  the  other  methods 
seems  somewhat  restrictive.  In  [8],  only  a  single  LPF  is  used  for  a  given  table.  Here, 
we  extend  the  results  to  splining  a  table  with  multiple  LPFs  to  increase  precision.  We 
look  at  the  complexity  issues  of  determining  these  multiple  LPFs  as  well  as  various 
special  algorithms. 
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We  begin  in  section  2  by  studying  single  LPFs.  In  section  3,  we  present  a 
natural  extension  of  single  LPFs  for  estimating  missing  data.  In  section  4.  we  discuss 
the  problem  of  applying  multiple  LPFs  and  its  complexity  issues.  And,  in  section  5, 
we  consider  some  special  approaches  to  multiple  LPFs  called  bisecting  splines  and 
merging  splines  which  are  easier  to  work  with. 


2.  Single  LPFs 

Given  a  Bayesian  network,  we  can  construct  an  ordered  pair  (V,  P )  where  V 
is  the  set  of  r.v.s  in  the  network  and  P  is  a  set  of  conditional  probabilities  associated 
with  the  network.  P(A  =  a  |  C\  =  ci, ...  ,Cn  =  cn )  €  P  iff  C\, . . . ,  Cn  are  all  the 
immediate  parents  of  A  and  there  is  an  edge  from  Ci  to  A  for  %  =  1 , . . . ,  n  in  the 
network.  We  can  clearly  see  that  (V,  P)  completely  describes  the  Bayesian  network. 

Notation.  !ft  denotes  the  real  numbers.  ft"  denotes  the  cross  product  of  3?  n  times. 

Let  B  be  a  Bayesian  network  and  let  Co  be  some  r.v.  in  B  with  associated 
conditional  table  T.  Assume  T  are  probabilities  of  the  form 

P(C0  =  Co  I  C\  =  Cl,  .  .  .  ,  Cn  =  Cn) 

and  we  would  like  to  replace  it  by  an  approximation  function  S.  S  should  be  a  real¬ 
valued  function  from  ’ft"+ 1  to  5?  with  each  argument  corresponding  to  one  of  the  c,;’s. 
Unfortunately,  as  we  pointed  out  earlier,  instantiations  for  r.v.s  need  not  be  numeric 
in  nature.  Hence,  it  is  necessary  to  map  them  into  real  values  for  our  puiposes. 

Notation.  Given  a  r.v.  A,  the  set  of  possible  values  for  A  called  the  range  of  A  will 
be  denoted  by  R(A). 

Definition  2.1.  Given  a  Bayesian  network  B  =  ( V.  P ),  an  instantiation  is  an  ordered 
pair  (A,  a)  where  A  £  V  and  a  €  R(A).  (An  instantiation  (A,  a)  is  also  denoted 
by  A  =  a  and  Aa.)  A  collection  of  instantiations  w  is  called  an  instantiation-set  iff 

(A,  a),  (A,  a')  in  w  implies  a  =  o'. 

An  instantiation  represents  the  event  when  a  r.v.  takes  on  a  value  from  its 
range. 4 

Definition  2.2.  Given  a  r.v.  A  in  V,  a  one-to-one  and  onto  mapping  E  \  from  11(A) 
to  ft  is  called  an  encoder  for  A. 

4  We  can  use  instantiations  and  assignments  interchangeably,  however,  whereas  instantiation  refers  strictly 
to  one  r.v.,  assignment  might  refer  to  a  collection  of  r.v.s. 
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Intuitively,  encoders  provide  a  total  ordering  on  the  possible  instantiations  for 
a  r.v.  As  we  saw  earlier  in  Figs.  2  and  3,  there  are  good  encoders  and  bad  encoders. 
Our  problem  is  certainly  easier  if  we  just  arbitrarily  choose  an  encoder  to  work  with. 
However,  we  cannot  guarantee  finding  the  best  LPF. 

The  heart  of  the  formulation  lies  in  the  effective  identification  of  encoders 
and  an  approximation  function  simultaneously.  In  particular,  we  are  interested  in 
approximation  functions  of  the  form 

S(EA(Co),ECl(ci),...,Ecn(cn))  =  efcoec0(co)+fcli?C1(c1)H-+fcnEc„(c„)+fc)  (1) 

which  are  our  linear  potential  functions  (LPFs)  [8].  Clearly  LPFs  are  simple  contin¬ 
uous  real  functions  but  their  appeal  lies  in  the  following  observation:  We  compute 
joint  probabilities  by  multiplying  the  various  conditional  probabilities  together  such 
as  p\P2  ■  ■  ■  Pk  via  the  chain-rule.  If  we  take  the  natural  logarithm  In  p\P2  ■  ■  ■  Pk  = 
In  pi  +  lnp2  +  •  •  •  +  hi  pi,  and  then  replacing  the  pf  s  by  LPFs,  this  reduces  our  com¬ 
putation  to  a  simple  linear  summation. 

Ideally,  we  would  like  S  to  satisfy 


ek0ECo(c0)+k1Ec1(cl)+-+k„ECri(cn)+k  _  P(Cq  =  C()  |  Ci  =Cl,...,Cn 


cn)i 


which  can  be  rewritten  in  a  simpler  form  as 

koEc0(co)  +  k\E Ci  (ci)  +  •  •  •  +  knEcn(cn)  +  k 

=  In  P(C0  =  co  |  Ci  =  ci*.. . . ,  Cn  =  Cn). 


As  we  can  easily  see,  our  goal  is  to  determine  the  constants  k,  ko .  k\. ... .  kn  and  to 
determine  the  mapping  of  each  particular  instantiation  of  a  r.v.  to  some  real  number. 
Hence,  we  have  the  following  variables  we  must  solve  for: 

•  The  constants  k,  ko,  fci, . . . ,  kn. 

•  For  each  r.v.  C\,  for  each  c,  E  R{Cf),  the  encoding  Ec,  (A:)- 

This  can  be  accomplished  by  minimizing  the  following  sum  over  the  entries  in  table  T: 


Y  kiEci  ^  +  k  -  ln  p(co  =  c0  I  •  •  • ,  Cl  =  Cl, . . .) 

coeit(Co)  L  3=  0 
Cn€.R{Cn) 


(2) 


Obviously,  this  equation  is  some  form  of  least-squares  fit.  We  can  find  the  minimum 
by  taking  the  partial  derivatives  over  (3)  with  respect  to  all  the  variables  we  must 
solve  for;  and  then  set  these  derivatives  to  0.  Unfortunately,  the  resulting  system  of 
equations  is  quadratic  making  it  quite  difficult  to  solve.  However,  there  are  certain 
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key  observations  about  the  minimization  (see  [8])  which  allows  the  absoiption  of  the 
constants  k,  ko, . . . ,  kn  resulting  in  a  simpler  least-squares  fit  as  follows: 


E 


E  Ecj  (ci)  “  ln  Wb  =  CO  |  •  •  ■  ,  Cl  =  Cl, . . .) 


(3) 


coefl(Co)  L  j=o 
Cn£:R{Cn) 

With  this  reduction,  we  have  the  following  closed  form  solution: 


Theorem  2.1.  The  minimal  solution  to  (3)  will  be  for  each  r.v.  Ck  in  T,  and  for  each 

Ck  €E  R(Ck), 


ECk(ck) 


E  In  P{C0  =  d0  \  ...,  Ci 

do£R(Co) 

dneR{Cn) 

dfc  =<^k 


dp...) 


n€(T) 

(n+  DYYUm 

where  m;  =  \R(Ci)\  for  i  =  0, . . . ,  n  and 

Z(T)=  E  ^P(Co  =  do\...,Cl  =  dl,...). 

(IqER(Co) 


(4) 


dn£R(C„) 

Proof.  This  follows  from  substituting  (4)  into  the  partial  first  derivatives  of  (3)  and 
proving  they  are  equal  to  0.  □ 


Furthermore,  we  can  determine  necessary  and  sufficient  conditions  for  a  perfect 
LPF.  Our  best  approximation  function  will  be  a  perfect  fit  if  and  only  if 


E  E°':i  =  ln  =  C°  I  •  •  •  ’  =  ch---) 

3=0 

for  all  Cj  E  R(Cj )  for  j  =  n. 

Theorem  2.2.  There  is  a  perfect  fit  if  and  only  if 


£' 

i=o 


-  In  P(C0  =  cq  |  . . . ,  Ci  =  ci, . . .)  rrii  = 


i= o 


n2gr) 
(n  +  1) 


holds  for  all  c,  E  R{Ci)  for  i  =  ()....,  n. 


(5) 


E  Ini^Co  =  Cq  \  Cl  =  c'l, Cj  =  Cj Cn  =  c'n 

c'0£R(C0) 

r  RJ'j  i) 
c'0£R(C0) 

r  RJ'j  l) 


(6) 
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Proof.  This  follows  from  substituting  (6)  into  (5).  □ 

From  this  theorem,  we  can  prove  the  following: 

Corollary  2.3.  Given  a  table  T  consisting  of  only  one  r.v.  and  either  all  entry  values 
are  distinct  or  identical,  the  best  LPF  S  over  T  will  be  a  perfect  fit. 

Although  this  corollary  posits  that  entries  must  be  all  distinct  or  identical,  we 
can  still  satisfy  this  condition  by  simply  perturbing  the  values  slightly. 

As  we  can  easily  see,  Theorems  2.1  and  2.2  allows  us  to  compute  LPFs  easily 
and  effectively  via  the  closed  form  solutions. 

Finally,  let’s  assume  that  some  set  {Ai,  A2, . . . ,  An}  of  r.v.s  in  our  given 
Bayesian  network  are  now  replaced  by  their  respective  LPFs  {Si,  S2,  •  •  • ,  Sn}.  We 
now  consider  the  overall  expected  error  of  computing  a  joint  probability  when  using 
our  LPFs.  Again,  since  multiplying  the  conditional  probabilities  is  equivalent  to  taking 
the  sum  of  the  logarithms,  the  expected  error  is  as  follows: 

n 

X  X  vi(]nPi(w)  -  In Pi(w)),  (7) 

2=1  W^Ti 

where  vt  =  1  /|T,  |,  Tr  |  is  the  number  of  entries  in  the  table  associated  with  A,,  P,  is 
the  conditional  probability  for  At,  and  Pi(w )  is  the  approximation  via  Sj. 

Theorem  2.4. 


X  X  vi{lnPi(w)  -  In-Pi(m))  =0. 

2=1  WETi 

Proof.  This  follows  from  substituting  in  the  optimal  encoder  values  into  (7).  □ 


3.  Incompleteness 

As  we  mentioned  earlier,  Bayesian  networks  require  that  the  conditional  tables 
be  complete  with  respect  to  all  the  possible  combinations  of  assignments.  Without  this 
condition,  it  becomes  impossible  to  compute  various  joint  probabilities.  Unfortunately, 
there  are  cases  when  some  of  the  conditional  probabilities  may  not  be  obtainable  due 
to  lack  of  information. 

Now,  let’s  briefly  consider  how  we  might  deal  with  incompleteness  using  our 
LPFs  approach.  The  problem  then  becomes,  how  do  we  approximate  through  a  con¬ 
ditional  table  with  missing  entries.  Clearly,  the  notion  of  approximating  through  the 
existing  probabilities  is  easy  to  see.  Once  we  have  determined  such  a  LPF,  the  pattern 
it  captures  will  allow  us  to  estimate  the  missing  value  by  substituting  corresponding 
r.v.  assignments  into  the  LPF. 
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Our  new  problem  formulation  is  as  follows:  Let  T'  be  an  incomplete  conditional 
table  for  Co-  We  say  that  (co,  cj, . . . ,  cn)  6  T'  if  the  associated  conditional  P(Cq  = 
co, . . . ,  Cn  =  cn)  is  available  in  T'.  Our  goal  is  to  minimize  the  following  sum  over 
the  available  entries  in  T'\ 


E 


n 


“1  2 


/  ,  Ecj  ( Cj )  -  In  P(Co  =  co  |  ...,Q  =  ci,. 

(co,...,c„)6T'  L  j=o 


(8) 


The  optimal  solution  to  (8)  can  be  determined  by  solving  the  following  system 
of  linear  equations: 


mCk(ck)ECk(ck)  +&k(T',ck)  +  E  E  mCk ,Cj  (ck ,  cj ) ECj  {dj )  =  0, 

3=0  dr  ItP'j) 
j¥=k 

where  mCk{ck )  =  \{(d0, . . . ,  dn)  G  T'  \  dk  =  ck}\, 

mck,Cj  (cfc,  cj)  =  |  {(do,  •  •  • ,  dn)  G  T'  \  dk  =  ck  and  dj  =  Cj}\,  and 


iCk{T',ck)  =  -  Y.  5>P(0)  =  do|  ...,Q=di,...) 

(do,...,dn)£T'  3=0 
=Ck 

for  all  ck  6  R{Ck),  k  =  0, . . . ,  n. 

Instead  of  performing  the  above  computations,  we  can  modify  our  closed  form 
solution  (4)  for  the  complete  table  to  get  encodings  for  T' .  The  basic  idea  is  to  drop 
terns  in  the  summations  involving  the  missing  entries.  Also,  we  must  modify  our 
constants  a  little  to  accurately  reflect  the  number  of  entries  being  used. 

For  each  r.v.  Ck  in  T' ,  and  for  each  ck  €  R{Ck), 


Eck(ck ) 


1 

mck(ck) 


Y  lnP(C0  =  d0|  ...,Q 

(do,...,dn)eT' 

dk=£k 


ngr) 

(n  +  l)\Tf 


(9) 


where 

mck(ck)  =  | {(do,  •  •  •  ,dn)  G  T'  \  dk  =  ck}\ 

Z(T')=  Y  In P(Co  =  c0  |  . . .  ,Ct  =  Cl,. . .). 

(co,-,Cn)eT' 


and 
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These  encodings  are  not  necessarily  optimal,  however,  they  can  still  provide  an  ade¬ 
quate  approximation  for  our  problem. 

To  get  an  optimal  solution  with  a  closed  form  answer,  we  consider  the  following 
optimization: 


E 


^2ECj(Cj)  -  g(co,...,Cn)  , 


(10) 


(co,...,cn)ET 


3=0 


where  g(co, ...,  cn)  =  In  P(Cq  =  cq  \  ...  ,Q  =  ci, ...)  if  (c0, . . . ,  cn)  G  T' ,  otherwise 
g(co, . . . ,  Cn)  is  a  function  on  some  set  of  neighboring  entries  of  (cq,  . . . ,  cn). 


Theorem  3.1.  The  minimal  solution  to  (10)  will  be  for  each  r.v.  Cf  in  T,  and  for  each 

ck  £  R{Ck), 


ECk(Ck) 


1 

i^k 


Y  g{do,...,dn) 

(do,...,dn)eT 
d-k  =Ck 


nd(T) 

(n+i)nr=o^’ 


where  m;  =  \R{Ci)\  for  i  =  0, . . . ,  n  and 

Z(T)=  Y  9(do,...,dn). 

(do,...,dn)€.T 


(ii) 


(Proofs  can  be  found  in  Appendix  8.) 

Also,  we  can  prove  the  following  theorem  about  perfect  fits  under  incomplete¬ 
ness  for  (10): 


Theorem  3.2.  There  is  a  perfect  fit  if  and  only  if 

n 

Ymi  9(do,...,dn) 

j= 0  -  (dQ,...,dn)€.T 

dj  =Cj 

holds  for  all  (cq,  ...  ,cn)  6  T. 


-g{c0,...,cn)Y\ 


rrii  = 


n2ar) 


i=0 


n 


(12) 


Proof.  This  immediately  follows  from  (10)  and  Theorem  3.1. 


□ 


As  we  mentioned  earlier,  we  can  directly  apply  our  LPFs  into  an  integer  linear 
programming  formulation  for  reasoning  with  Bayesian  networks.  This  particular  ap¬ 
proach  also  works  naturally  with  the  incomplete  LPFs  requiring  absolutely  no  change 
in  the  formulation.  (See  [8]  and  section  7  for  a  description  of  the  method.) 

The  methods  for  finding  the  optimal  LPF  as  well  as  estimating  incomplete  infor¬ 
mation  is  not  restricted  to  Bayesian  networks  and  their  conditional  probability  tables. 
In  fact,  it  can  be  applied  to  any  table  of  values  in  any  domain.  Hence,  we  have  a 
general  scheme  which  can  be  used  in  many  domains  needing  data  compression  and/or 
dealing  with  incompleteness. 
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4.  Multiple  LPFs 


Single  LPFs  work  best  when  there  is  an  overall  single  pattern  on  the  conditional 
table.  Unfortunately,  people  who  design  and  develop  the  knowledge  bases  (networks) 
can  axiomatize  the  domain  in  a  strange  fashion  resulting  in  somewhat  disjointed  tables 
made  up  of  multiple  patterns.  For  example,  take  the  simple  case  where  we  have  two 
r.v.  A  and  B  with  A  conditionally  dependent  on  B: 


P{A  =  a i  |  5  =  61)  =  0.10, 
P(A  =  02  |  B  =  61)  =  0.15, 
P(A  =  a3  |  B  =  61)  =  0.30, 
P(A  =  a4  |  B  =  61)  =  0.45, 


P(A  =  ai  |  B  =  b2)  =  0.45, 
P(A  =  02  |  B  =  b2)  =  0.30, 
P(A  =  a3  j  5  =  62)  =0.15, 
P(A  =  a4  |  B  =  62)  =  0.10. 


In  the  case  of  61,  our  values  are  ascending.  For  b2,  they  are  descending.  Flere,  we 
have  two  distinct  patterns  which  will  be  difficult  to  unify  into  a  single  LPF. 

One  possible  solution  is  to  use  a  simple  process  called  clustering  [6].  We 
can  always  merge  a  set  of  r.v.s  into  a  single  r.v.  and  still  preserve  the  probability 
distribution  of  the  given  domain.  Assume  P(B  =  b\)  =  0.2  and  P(B  =  b2 )  =  0.8. 
We  can  create  a  new  r.v.  C  as  follows: 

P{C  =  ai&i)  =  0.02,  P(C  =  aib2)  =  0.36, 

P(C  =  a2bi )  =  0.03,  P{C  =  a2b2)  =  0.24, 

P(C  =  a3bi)  =  0.06,  P(C  =  a3b2)=  0.12, 

P{C  =  a4bi)  =  0.09,  P{C  =  a4b2)  =0.08. 

Replacing  A  and  B  by  C  and  modifying  the  appropriate  conditional  links  to  point  to 
C  will  not  change  the  distribution  of  the  domain.  Clustering  is  used  to  reduce  the 
number  of  nodes  in  the  network  as  well  as  reducing  the  graph  into  simpler  graphs 
such  as  polytress.  Flowever,  as  we  can  easily  see,  our  table  size  has  increased  by 
a  multiplicative  factor  which  makes  clustering  expensive  and  is  especially  so  for  the 
algorithms  which  need  to  search  through  this  unordered  table. 

On  the  positive  side,  since  we  have  reduced  ourselves  to  a  table  with  a  single 
r.v.,  we  are  guaranteed  by  Corollary  2.3  that  we  will  have  a  perfect  fit  for  our  LPF. 
Although  we  have  reduced  ourselves  to  a  single  perfect  LPF,  we  must  still  maintain  the 
encoding  information  for  the  r.v.  assignment.  Space-wise,  we  have  only  traded  one  set 
of  numbers  -  the  probabilities  in  the  table  for  another  set  -  the  encodings.  Flowever, 
this  can  still  be  an  improvement  over  the  old  methods  when  combined  with  an  integer 
linear  programming  algorithm  (see  section  7).  Mainly,  we  would  avoid  performing 
a  complete  search  through  an  unordered  table.  The  transformation  via  the  encoding 
provides  such  an  ordering.  In  the  case  where  we  may  have  fixed  or  restricted  the 
possible  values  to  one  or  more  of  the  original  r.v.s  before  clustering  (such  as  evidence 
r.v.s),  determining  which  encodings  are  consistent  with  these  restrictions  can  be  easily 
accomplished  with  a  simple  cross-indexing  scheme  for  storing  the  encodings. 

Still,  one  of  our  goals  is  to  reduce  space  consumption.  Clustering  is  an  incred¬ 
ibly  efficient  way  of  swallowing  up  memory.  Instead  of  clustering  to  solve  our  above 
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example,  we  can  use  two  separate  LPFs  over  the  original  table.  One  LPF  will  range  over 
the  table  with  B  fixed  to  bi  and  the  other  fixed  to  fo-  In  particular,  for  this  example, 
the  individual  subtables  really  consist  of  only  one  r.v.  whose  value  changes  across  the 
entries.  Thus,  it  follows  from  Corollary  2.3  that  the  LPFs  will  be  perfect  fits  for  their 
respective  subtables.  As  opposed  to  storing  a  multiplicative  number  of  encodings  via 
the  clustering  method,  we  only  need  to  keep  an  additive  number.  Clearly,  this  is  a 
major  point  for  using  multiple  LPFs  especially  when  we  generalize  to  larger  tables 
involving  several  r.v.s.  Also,  they  naturally  give  us  better  approximations  compared 
to  single  LPFs.  In  effect,  we  are  splining  the  points  in  the  table  with  multiple  LPFs. 

We  can  obviously  benefit  a  great  deal  from  multiple  LPFs  but  we  must  consider 
how  to  use  them  appropriately.  We  realize  that  there  is  a  trade  off  to  using  more 
and  more  LPFs.  While  we  get  improved  approximations,  we  also  increase  space 
consumption  (although  this  is  additive  as  opposed  to  multiplicative)  and  will  likely 
increase  the  problems  of  trying  to  find  such  LPFs.  Another  critical  problem  will 
involve  determining  exactly  which  LPF  we  are  supposed  to  use  in  retrieving  a  particular 
conditional  probability. 

Let’s  first  address  the  critical  issue  of  choosing  the  right  LPF  which  is  supposed 
to  “cover”  a  given  entry  of  interest.  If  we  simply  allow  our  LPFs  to  arbitrarily  approxi¬ 
mate  collections  of  table  entries,  we  are  back  to  our  original  problem  of  not  saving  any 
space.  We  would  essentially  have  to  keep  a  complete  table  of  which  entries  belong 
to  which  LPF  regardless  of  how  many  LPFs  are  being  used.  However,  by  intelligently 
partitioning  the  table  and  then  using  exactly  one  LPF  on  each  partition,  we  can  develop 
an  inexpensive  method  for  determining  which  partition  a  particular  entry  resides  which 
of  course  also  tells  us  which  LPF  to  use. 

Again,  let  T  be  a  conditional  table  with  probabilities  P{C$  =  cq  \  C\  = 
Cn  =  cn).  Also,  let  Eck  be  some  encoder  for  C\  for  k  =  ()....,  n.  We  call 
the  set  E(T)  =  { Ec0 , . . . ,  Ecn}  an  encoder-set  for  T. 

Definition  4.1.  Let  p(T)  =  {or, . . . ,  <rs}  be  a  partition  on  T  and  £(T)  =  {Ei(T), 
. . .  ,ES(T)}  be  encoder-sets  for  T.  We  say  that  g(T)  =  (p(T),£(T))  is  contiguous5 
if  and  only  if  for  each  cell  07  in  p(T),  there  exists  constants  a*  and  /%  for  i  =  0, . . . ,  n 
such  that 


07  =  {-P(Co  =  Co  I  C\  =  Cl,  .  .  .  ,  Cn  =  Cn)  £  T  I 

oti  ^  Elc.  (a)  ^  Pi  for  i  =  0, . . . ,  nj,  (13) 

where  Ei(T)  =  {E1Cq,  . . . ,  ElCn}.  We  call  each  <j;  in  p(T)  a  sub-table  of  T. 

Intuitively,  we  can  view  T  as  a  multi-dimensional  hypercube  as  follows:  For 
the  sake  of  simplicity,  let’s  assume  that  the  encoder-sets  E(T )  are  all  identical,  that 


5  A  preliminary  version  of  contiguity  appears  in  [8],  This  new  definition  generalizes  [8]  by  incorporating 
multiple  encoders  as  well  as  multiple  LPFs. 
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is,  EftT)  =  Ej(T )  for  all  i  and  j.  For  each  C,,  we  have  uniquely  numbered  the 
instantiations  RiCftj  via  Eq, ■  We  now  construct  a  hypercube  whose  axes  correspond 
to  each  r.v.  Ci.  We  associate  entry  P(Cq  =  co  \  C\  =  c\. ... ,  Cn  =  cn )  as  the 
point  in  this  hypercube  at  coordinates  (Ec0(co),  ■  .  .  ■  Ec„  (cn)).  The  hypercube  will 
be  bounded  by  the  minimum  and  maximum  values  of  the  encoding  on  each  R{Cft). 
A  contiguous  partition  of  T  will  simply  cut  up  the  hypercube  into  a  set  of  smaller 
hypercubes  each  bounded  by  the  afts  and  ftft  s. 

Since  each  mini-hypercube  oy.  £  p(T)  will  be  bounded  by  a  unique  collection 
of  ay..’ s  and  ftp ’s,  these  constants  will  serve  as  our  mechanism  to  determine  which 
LPF  an  entry  belongs  to.  Furthermore,  this  will  correspond  to  a  simple  satisfiability 
test  on  the  collections  of  inequalities  from  (13)  for  each  sub-table.  Clearly,  the  number 
of  constants  and  equations  involved  are  a  small  factor  linear  in  n  and  s. 

Associating  different  encoder-sets  to  each  cr  increases  our  flexibility.  All  that 
happens  now  is  that  we  still  guarantee  the  bounding  properties  for  each  oy  via  EftT). 
Flowever,  applying  EftT)  to  a:j  when  l  ft  j  does  not  require  this  property.  In  order  to 
determine  which  hypercube  an  entry  now  belongs  to,  we  simply  make  the  appropriate 
encodings  from  each  EftT )  and  run  the  inequalities  test  for  07  in  particular'.  We  are 
guaranteed  that  the  appropriate  hypercube  containing  this  entry  can  be  determined 
uniquely  in  this  way. 6 

Now,  with  a  partitioning  scheme  available,  we  next  consider  how  to  determine 
these  multiple  LPFs.  The  fundamental  difficulty  we  encountered  in  formulating  single 
LPFs  naturally  occurs  for  multiple  LPFs,  namely,  before  we  can  partition  a  table,  we 
must  have  an  encoding  and  vice  versa.  Again,  we  must  simultaneously  determine  the 
two  in  order  to  find  the  best  multiple  LPFs. 

Multiple  LPFs  can  be  of  immediate  benefit  when  the  designer  can  identify  pat¬ 
terns  in  the  tables  already.  We  would  then  simply  try  to  find  the  best  LPFs  using 
the  single  LPF  technique  for  each  pattern/hypercube  identified.  Clearly,  this  sort  of 
situation  is  the  best  we  could  hope  for  reducing  our  overall  problem. 

When  such  an  identification  is  not  readily  available,  we  must  then  consider 
the  following  question:  What  is  the  minimum  number  of  LPFs  that  best  fits  a  given 
table?  We  know  that  if  we  decided  to  associate  one  LPF  with  each  entry  individually, 
these  |f?(C'o)||ff(C'i)|  . . .  \R(Cn)\  LPFs  will  perfectly  fit  the  table.  Clearly  though,  this 
answer  is  unsatisfactory.  We  can  get  a  second  slightly  tighter  bound  as  follows: 

Theorem  4.1.  Let 

\R{Cm)\  =  max  |-R(Ci)|. 

*=(),. ,.,n 

There  exists  \R(Co)\  . . .  |H(C'm_i)||i2(C'm+i)|  . . .  \R(Cn)\  LPFs  that  perfectly  fit  table 
T  with  minor  perturbations  to  T  if  necessary. 


6  At  the  end  of  section  7,  we  provide  a  brief  description  of  how  we  can  model  this  task  as  integer  linear 
programming. 
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Proof.  For  each  possible  assignment  {Co  =  cq,  =  cm_  i ,  Crn+  \  =  cm+ 1, 

...,Cn  =  cn},  construct  a  cell  whose  entries  are  consistent  with  the  assignment. 
Hence,  this  cell  will  only  have  entries  which  vary  over  Cm.  Since  each  cell  now 
really  consists  of  one  r.v..  Corollary  2.3  guarantees  a  perfect  LPF  for  the  cell.  □ 


Basically,  we  can  construct  a  whole  collection  of  LPFs  that  effectively  only  have 
to  deal  with  one  r.v.  Cm  varying  its  assignment  through  the  cell.  Hence,  we  can  easily 
construct  a  perfect  LPF  for  each  partition. 

Returning  to  the  general  question  of  determining  the  minimum  number  required, 
we  find  that  the  major  difficulty  will  be  in  attempting  to  examine  all  the  different 
possible  partitionings  of  the  table  and  whether  or  not  a  given  partitioning  permits  a 
set  of  good  LPFs. 

More  formally,  we  have  the  following  least-squares  optimization  problem  with 
constraints:  Let  each  possible  r.v.s  assignment  in  T  be  represented  by  the  n-tuple 

(Q)j  Cl ,  ,Cn). 

Definition  4.2.  A  characteristic  function  y  on  T  is  a  function  mapping  the  n-tuple 
assignments  of  T  to  {0, 1}  such  that  x  can  be  decomposed  into  n  projection  functions 
Xc ,  which  map  R(Ct )  to  {0,  1}  and 

X{c0,  Cl,  ...  ,  Cn)  =  XCg(co)XCi  (cl)  •  ■  •  XCn(Cn). 


Intuitively,  a  characteristic  function  for  T  can  completely  describe  any  hyper¬ 
cube  (sub-table)  arising  from  a  contiguous  partitioning  of  T.  A  value  of  1,  indicates 
that  an  entry  belongs  to  the  sub-table. 

Our  goal  is  to  minimize  the  following  sum: 


£  £  x‘ ■ 

t=  1  (co,...,c„)6T 


^  kjEcj  ( Cj )  +  kl  —  In  P(Cq  —  co  |  . . . ,  Ci  —  cp, . . .) 

3=0 


(14) 


under  the  constraint: 

S 

X* (co,  •  ■  ■ ,  cn)  =  1  for  all  (c0, . . . ,  cn)  G  T,  (15) 

t= l 


where  x*  is  the  characteristic  function  for  hypercube  t.  Note  that  this  minimization 
actually  corresponds  to  the  problem  of  finding  the  best  s  or  fewer  LPFs.  A  hypercube 
with  no  entries  is  still  a  partition. 

The  variables  we  minimize  over  in  (14)  are 
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•  The  constants  k^, , . . ,  k^,  k*  for  t  =  1 ,m. 

•  The  encodings  E (q)  for  i  =  0, . . . ,  n  and  t  =  1 , . . . ,  s. 

•  The  projections  Xc0(co)>  Xci  (ct),  ■  •  •  >  Xc„(c«)  associated  with  characteristic 
functions  xt  for  t  =  1 ,s. 

In  order  to  find  the  minimal  solution,  we  can  fold  into  the  objective  function 
the  additional  constraints  of  (15)  and  the  projection  functions  restriction  to  {0,  1}  by 
using  Lagrange  multipliers.  We  can  rewrite  the  {0, 1}  restriction  mathematically  as 
the  constraint: 

XcM)2  -  XcM)  =  0. 

We  begin  by  taking  the  partial  derivatives  of  the  new  combined  objective  function 
with  respect  to  the  above  variables  and  Lagrange  multipliers  and  setting  them  to  0. 
Through  Lagrange’s  method,  the  minimal  solution  must  satisfy  the  new  equations. 
Unfortunately,  unlike  our  single  LPF  minimization,  our  objective  function  is  no  longer 
quadratic  in  nature.  The  space  of  possible  solutions  includes  all  sorts  of  extreme 
points  such  as  local  minimas  and  saddlepoints.  In  fact,  this  space  is  extremely  huge 
and  covers  about  any  possible  combination  of  partitions  and  encodings  as  demonstrated 
by  the  next  theorem. 


Theorem  4.2.  Given  any  hypercube  partition  p(t)  =  { rx  i , . . . ,  as }  and  the  following 
encodings  for  Ck  €  Rt{ Ck ), 


ECk(°k)  = 


IT 


l—\  V  In P(Co  =  do  \  ■■■  ,Ci  =  di, — )  1 

=o  \  , 

Ik  1  y  dnGRt(Cn)  ) 


1  y  doe# (Co) 

d„6R*(C„) 

dk=ck 

nf*(T) 

~  («+Dnr=o"»r 

where  R^Ci)  =  [ct  €  R{Ci)  \  Xq(g)  =  1},  m\  =  |i2*(C,i)|,  and 

e{T)=  ^P{CQ  =  dQ\...,Cl  =  du...), 

(co,...,c„)6crt 

this  is  an  extreme  point  for  (14). 


(16) 


To  complete  Theorem  4.2,  we  must  consider  those  assignments  in  R(Ci)  — 
R^iCi)  for  which  we  have  not  computed  an  encoding.  To  guarantee  that  our  solution 
is  contiguous,  all  we  need  to  do  is  make  sure  that  these  encodings  are  outside  the 
bounds  imposed  by  the  hypercube  oy. 

Hence,  the  traditional  methods  for  solving  this  least-squares  problem  will  be 
difficult  to  apply  successfully.  In  fact,  with  all  these  local  minimas,  maximas  and 
saddlepoints,  this  strongly  suggests  that  this  problem  is  combinatorial  in  nature. 

In  Appendix  5,  we  present  a  possible  approach  based  on  integer  linear  pro¬ 
gramming  for  multiple  LPFs. 
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5.  Bisecting  splines 

As  we  have  seen  from  the  previous  section,  determining  the  globally  optimal 
partitioning  seems  to  be  quite  hard.  However,  we  must  realize  that  obtaining  the 
global  optimum  is  not  necessarily  required.  Our  original  goal  with  LPFs  is  to  ease  the 
computational  bottleneck  from  having  to  deal  with  large  amounts  of  data.  The  global 
optimum  simply  gives  us  our  ideal  approximation. 

Methods  which  give  us  good  solutions  that  may  not  necessarily  be  the  best  can 
be  of  great  benefit  especially  if  they  are  relatively  easy  to  compute.  In  this  section, 
we  consider  a  method  based  on  bisecting  splines. 

Clearly,  the  problem  of  determining  the  best  contiguous  partitioning  rests  on 
the  combinatorial  number  of  partitions  possible.  Hence,  we  would  like  to  restrict  the 
number  of  partitions  we  search  through.  Our  approach  is  an  iterative  scheme  which 
continuously  refines  our  splining  approximations. 

The  bisecting  splines  method  begins  as  follows:  First,  compute  the  best  single 
LPF  over  T  as  in  section  2.  Once  we  have  this  LPF,  we  then  determine  the  entry  in 
T  which  is  worst  approximated  by  the  single  LPF.  From  the  single  LPF’s  encoders, 
we  can  totally  order  the  assignments  in  each  R(Ci).  This  ordering  will  help  us  to 
break  the  table  in  n  +  1  different  ways  into  2  sub-tables  each.  Basically,  assume 
(do,  •  •  • ,  dn)  is  the  entry  worst  approximated.  Now,  for  each  r.v.  C„  partition  RiC,) 
into  two  disjoint  subsets:  One  subset  containing  all  those  assignments  which  precede 
di  in  the  ordering,  and  the  other  subset  containing  those  succeeding  d,  including  d, . 

We  construct  our  first  partitioning  using  R{Cq )  via  its  subset  split.  The  remain¬ 
ing  partitioning  use  R(Ci),  respectively.  Next,  taking  each  2  sub-table  partitioning, 
we  consider  the  individual  sub-tables  and  compute  the  best  LPF  spline  over  it.  Hence, 
each  2  sub-table  partitioning  will  have  2  LPFs  giving  us  a  contiguous  partition. 

We  now  choose  the  best  contiguous  partition  from  among  the  n  +  1  created. 
We  are  guaranteed  that  this  new  2-LPF  approximation  will  be  at  least  as  good  as  the 
original  single  LPF.  Basically,  if  the  single  LPF  had  been  the  best  possible,  then  the 
partitioning  into  2  sub-tables  would  not  have  had  any  effect. 

We  continue  to  refine  our  approximations  iteratively  by  now  determining  which 
new  entry  is  worst  approximated  by  the  2  new  splines.  We  then  take  the  sub-table 
containing  this  entry  and  break  it  further  in  n  different  ways  into  2  sub-sub-tables 
so  to  speak.  Again,  we  choose  the  best  partitioning  among  these  n  and  get  a  new 
3-lpf  approximation.  As  well,  this  3-LPF  approximation  will  be  at  least  as  good  as 
the  2-lpf.  We  can  continue  this  process  until  we  reach  the  extreme  case  of  one  LPF 
for  each  entry  in  T.  In  essence,  we  are  making  successively  finer  and  finer  partitions. 

Clearly,  the  bisecting  splines  method  is  relatively  easy  to  perform.  Furthermore, 
it  guarantees  better  approximations  as  we  continue  to  partition  the  tables. 

An  alternative  approach  to  bisecting  splines  is  to  start  from  the  finest  partition 
and  merge  into  successively  larger  partitions.  In  essence,  we  begin  by  associating  a 
LPF  with  every  entry  and  construct  coarser  and  coarser  approximations.  With  each 
iteration,  we  attempt  to  merge  partitions  in  a  manner  which  incurs  the  least  amount 
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of  error.  We  can  stop  when  either  we  have  a  sufficiently  small  number  of  partitions 
or  some  error  threshold  is  exceeded. 

Notation.  Give  a  cell  a  in  some  partition  of  T,  we  define 

spanc.(<r)  =  {q  G  R{Ci )  |  3 (do, . . . ,  dnj  e  o  such  that  q  =  d./}. 

We  say  that  a  cell  a  is  complete  if 

V(co, . . . ,  cn)  €  spanCo(cj)  x  •••  x  spanCn(cr),  (c0,...,cn)  €  o. 

Given  two  cells  o\  and  02-  we  say  that  they  are  mergeable  if  0 \  U 02  is  complete. 

Our  method  begins  as  follows:  Partition  T  so  that  each  cell  has  exactly  one 
entry  and  let  of  represent  such  a  cell.  Next,  arbitrarily  choose  some  cell  of  and 

K  other  cells  l  of . of  j  for  some  fixed  constant  K.  If  there  does  not  exist  a 

mergeable  of  with  of,  then  try  choosing  a  different  set  of  K  until  one  is  found.  For 
each  mergeable  cell  of  with  of,  compute  the  best  single  spline  over  newly  formed 
hypercube.  Choose  the  merging  which  introduces  the  least  amount  of  error  and  form 
a  new  partition  on  T  consisting  of  the  untouched  cells  along  with  the  new  merged 
cell.  Denote  the  cells  in  this  new  partition  by  of 

In  an  iterative  process,  we  now  attempt  to  merge  the  of  s.  We  start  again  by 
arbitrarily  choosing  some  cell  o\ . 7  Choose  K  other  cells  such  that  there  is  at  least 
one  mergeable  cell  with  of  Again,  perform  the  merging  which  introduces  the  least 
error  creating  a  new  partition  of 

We  are  guaranteed  that  we  will  always  generate  contiguous  hypercubes  using 
this  method.  As  well,  we  will  have  some  control  over  the  error  involved  with  coarser 
partitions. 

Like  the  bisecting  splines  method,  merging  splines  are  relatively  easy  to  per¬ 
form.  We  can  also  merge  the  two  techniques  together  giving  us  a  mix  of  bottom-up 
and  top-down  processing.  Together,  these  methods  provide  us  with  viable  approaches 
for  computing  multiple  splines. 


6.  Results 

In  this  section,  we  now  perform  some  experiments  on  computing  multiple  LPFs. 
In  particular,  we  consider  the  bisecting  splines  algorithm  described  in  the  previous 
section  as  compared  to  the  single  LPF. 

There  are  several  measures  one  can  choose  to  determine  the  quality  of  the  LPF 
approximations.  For  our  experiments,  we  measured  the  approximations  against  the 
following  two  metrics: 


7  Instead  of  randomly  choosing  a  cell,  we  could  weight  them  according  to  the  number  of  entries  in  each 
cell.  Our  goal  is  to  merge  smaller  cells  together  first  in  somewhat  more  of  a  bottom  up  fashion. 
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Absolute  worst  fit  - 

max  I P(c)  —  P(c)\. 

c&T 

(17) 

Relative  average  fit  - 

1  |P(c)-P(c)| 

m  l  p(c)  • 

(18) 

where  T  is  the  table,  P  is  the  actual  probability,  and  P  is  our  approximated  probability. 
The  relative  fit  will  give  us  a  percentage  difference  from  the  original  table  entry  value. 

We  randomly  generated  conditional  probability  tables  which  varied  in  size  from 
100,000  entries  to  3,200,000  entries  and  also  varied  the  number  of  r.v.s  found  in  the 
tables.  In  addition,  we  also  generated  tables  ranging  from  uniform  distributions  to 
highly  skewed  distributions. 

Clearly,  our  approximation  approach  should  fare  worst  when  faced  with  com¬ 
pletely  random  tables.  By  their  very  nature,  we  do  not  expect  to  find  any  patterns 
or  structures.  We  do  realize  though  that  most  of  the  tables  used  in  practice  will  by 
necessity  be  structured  in  some  form. 

Instead  of  generating  completely  random  tables,  we  modify  it  slightly  to  permit 
some  structure.  We  began  by  arbitrarily  ordering  the  possible  assignments  for  each 
r.v.  and  then  generated  random  values  in  the  table  with  the  following  property:  Let  a\ 
and  a 2  be  some  instantiations  for  r.v.  A.  For  any  two  entries  in  the  table  which  share 
exactly  the  same  instantiations  to  all  the  r.v.s  except  for  A  and  A  is  either  ai  or  a 2, 
then  the  value  in  the  entry  with  ai  should  have  a  value  less  than  that  for  ao.  Note, 
however,  that  after  we  normalize  the  table,  this  property  may  no  longer  hold. 

We  began  by  taking  these  tables  and  fitting  them  with  a  single  LPF.  Our  results 
measured  using  both  absolute  worst  fit  and  relative  average  fit  are  summarized  in 
Tables  1  and  2,  respectively. 

With  these  tables,  we  performed  the  bisecting  splines  algorithm  to  generate  a 
2-lpf  approximation.  Our  results  are  summarized  in  Tables  3  and  4. 


Table  1 

Single  LPF  absolute  worst  fit. 


100,000 

200,000 

400,000 

800,000 

1.600,000 

3,200.000 

#  r.v.s 

Min 

Max 

Avg. 

3-12 

1.13  x  10~3 
4.98  x  10~3 
3.00  x  10~3 

3-12 

1.21  x  10~3 
4.13  x  10~3 
2.89  x  10~3 

3-12 

7.97  x  10~4 
3.46  x  10~3 
2.52  x  10~3 

3-12 

6.54  x  10"4 
3.49  x  10~3 
2.51  x  10~3 

3-12 

5.17  x  10~4 
3.35  x  10~3 
2.40  x  10~3 

3-12 

4.52  x  10~4 
3.59  x  10~3 
2.34  x  10~3 

Table  2 

Single  LPF  relative  average  fit. 

100,000 

200,000 

400,000 

800,000 

1.600.000 

3,200.000 

#  r.v.s 

Min 

Max 

Avg. 

3-12 

4.25  x  10~4 
1.67  x  10~3 
9.22  x  10~4 

3-12 

4.08  x  10~4 
1.43  x  10~3 
8.22  x  10~4 

3-12 

2.70  x  10~4 
1.16  x  10~3 
7.83  x  10~4 

3-12 

2.46  x  10~4 
1.08  x  10“3 
7.25  x  10~4 

3-12 

2.08  x  10~4 
1.00  x  10~3 
6.96  x  10~4 

3-12 

1.85  x  10~4 
9.94  x  10~4 
6.96  x  10~4 
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Table  3 


Bisecting  spline  algorithm  absolute  worst  fit  improvement. 


100,000 

200,000 

400,000 

800,000 

1,600,000 

3,200,000 

Min 

Max 

Avg. 

90.19% 

31772.00% 

2591.38% 

44.22% 

39570.20% 

1838.02% 

83.83% 

47397.10% 

1886.17% 

34.80% 

29167.50% 

1651.69% 

56.24% 

28701.60% 

1578.91% 

20.65% 

19788.30% 

1146.99% 

Table  4 

Bisecting  spline  algorithm  relative  average  fit  improvement. 


100,000 

200,000 

400,000 

800,000 

1,600,000 

3,200,000 

Min 

Max 

Avg. 

16.24% 

3205.82% 

562.08% 

11.28% 

4257.28% 

600.57% 

11.65% 

5882.37% 

567.58% 

12.42% 

3689.36% 

382.36% 

8.02% 

5194.37% 

508.41% 

5.86% 

6075.96% 

497.53% 

7.  LPFs,  belief  revision,  and  ILP 


We  now  present  how  LPFs  (single  as  well  as  multiple)  can  be  merged  with 
integer  linear  programming  for  reasoning  with  Bayesian  networks. 8  In  particular,  we 
consider  belief  revision  which  can  be  used  for  modeling  explanatory/diagnostic  tasks. 

In  belief  revision,  some  evidence  or  observation  is  given  to  us,  and  our  task 
is  to  come  up  with  a  set  of  hypothesis  that  together  constitute  the  most  satisfactory 
explanation/interpretation  of  the  evidence  at  hand.  More  formally,  if  11'  is  the  set 
of  all  r.v.s  in  our  given  Bayesian  network  and  e  is  our  given  evidence,9 10  any  com¬ 
plete  instantiations  to  all  the  r.v.s  in  W  which  is  consistent  with  e  will  be  called  an 
explanation  or  interpretation  of  e.  Our  problem  is  to  find  an  explanation  w*  such  that 

P(w*  |  e)  =  maxP(w  |  e).  (19) 

W 

Intuitively,  we  can  think  of  the  non-evidence  r.v.s  in  W  as  possible  hypothesis  for  e. 

To  solve  this  using  integer  1  inear  programming  involves  mapping  the  r.v.  instan¬ 
tiations  into  some  multi-dimensional  space  which  we  will  denote  by  W.  A  subspace 
of  ff"  will  represent  “valid”  instantiations  where  valid  includes  things  like  being  con¬ 
sistent  to  the  given  evidence  e,  each  r.v.  has  at  most  one  instantiation,  etc.  In  particular, 
we  are  interested  in  transforming  it  into  a  polyhedral  convex  set. 10  Such  a  set  can  be 
described  by  a  collection  of  linear  inequalities.  As  it  turns  out,  these  inequalities  will 
intuitively  correspond  to  the  restrictions/constraints  required  in  making  valid  instanti¬ 
ations  of  the  r.v.s.  Finally,  we  would  like  to  define  a  linear  energy  function  such  that 
by  minimizing  it  over  the  convex  set,  the  resulting  answer  will  be  the  most  probable 
solution. 


8  See  [8]  for  additional  details. 

9  That  is,  e  represents  a  set  of  instantiations  made  on  a  subset  of  W . 

10  “Polyhedral”  refers  to  the  fact  that  the  boundaries  of  the  subspace  are  composed  of  hyperplanes. 
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Notation.  Throughout  the  remainder  of  this  paper,  upper  case  italicized  letters  such 
as  will  represent  r.v.s  and  lower  case  italicized  letters  such  as  a,b, . . .  will 

represent  the  possible  assignments  to  the  associated  upper  case  letter  r.v.,  in  this  case, 
A,  B, .. .  .  Subscripted  upper  case  letters  which  are  not  italicized  are  variables  in  a 
constraint  system  which  explicitly  represent  the  instantiation  of  the  associated  r.v.  with 
the  item  in  the  subscript.  For  example,  Aa  denotes  the  instantiation  of  r.v.  A  with 
value  a. 

Definition  7.1.  Given  an  instantiation-set  w  for  a  Bayesian  network  B  =  (V,  P),  we 
define  the  span  of  w,  span(m),  to  be  the  collection  of  r.v.s  in  the  first  coordinate 
of  the  instantiations.  Furthermore,  an  instantiation-set  w  is  said  to  be  complete  iff 
span(m)  =  V. 

Notation.  For  each  r.v.  A,  we  define  condf/l)  as  follows:  B  G  cond(,4)  iff  there  exists 
a  conditional  probability  in  P  of  the  form  P(A  =  a  \  . . . ,  B  =  b, . . .). 

Notation.  Given  an  instantiation-set  w  for  B  such  that  cond(/l)  C  span(m),  w(A) 
denotes  the  instantiation  A  =  a  where  {A,  a)  G  w. 

m(cond(A))  =  {w(B)  \  B  G  cond(y4)}. 

Let  Sb  be  a  collection  of  encoders  for  the  r.v.s  in  V  such  that  there  is  exactly 
one  encoder  for  each  r.v. 

Notation.  Let  T  be  a  collection  of  conditional  probabilities  from  P. 

RV(T)  =  { A  |  P(Co  =  co  |  Ci  =  ct, . . . ,  Cn  =  cn)  G  T  and  Ci  =  A  for  some  i}. 
Definition  7.2.  Let  T  be  a  conditional  table  in  B  and  assume 

rv(T)  =  {C0,Cu...,Cn}. 

Let  p(T)  be  a  partition  on  T.  We  say  the  p(T)  is  contiguous  with  respect  to  Sp  if 
and  only  if  for  each  cell  o  in  p(T),  there  exists  at.  Pi  for  i  =  0, . . . ,  n  such  that 

a  =  {P(C0  =  c0\Cl=Cu,..,Cn  =  Cn)eT\ 
oti  ^  ECi  (q)  ^  Pi  for  i  =  0, . . . ,  n). 

We  call  each  o  in  p(T)  a  sub-table  of  T. 

{Note:  This  is  a  simpler  definition  of  contiguity  than  Definition  4.1.  Flere,  we  only 
consider  a  single  encoder-set  for  ease  of  discussion.  At  the  end  of  this  section,  we 
will  briefly  consider  multiple  encoders.) 
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Definition  7.3.  Let  p(T)  be  a  contiguous  partition  with  respect  to  £B.  For  each  cell 
a  in  p(T),  we  associate  a  LPF  S£b )(T  from  ’ft""1  1  to  5ft  such  that 

SsB,a{Ec0(co),  EcM)> ■  ■  ■  >  Ec„(cn))  =  P(C0  =  Cq  |  C\  =  C\ , . . . ,  Cn  =  cn),  (20) 


where  Eq0  ,  Ecl ,  •  •  • ,  Ecn  are  the  encoders  found  in  £g  and 

co  G  R(,Cq),  . . . ,  cn  G  R(Cn). 

We  call  S£Bi(T  a  LPF  for  a.  Let  ScB  f)(r  j  be  a  collection  of  LPFs  associated  with  partition 
p(T).  We  call  S£b  ^T)  a  LPF -set  for  T. 

Let  S£b  be  a  space  of  LPF-sets,  also  called  a  linear  potential  space,  such  that 
each  table  T  in  B  is  associated  with  exactly  one  LPF-set. 

Notation.  Given  S£  p(r),  f°r  anY  co  G  R{Cq),  . . . ,  cn  G  R(Cn), 

S£bAT)  (^Co(co),  •  •  • ,  ECn(cn)) 

will  unambiguously  refer  to  the  appropriate  LPFLPF  defined  in 

S£bAT)- 

Given  S£b,  since  all  our  conditional  probability  tables  are  unique,  for  any 

co  £  R(Co), . . . ,  cn  G  R(Cn), 

s£b  (eCo(co)i  •  •  • ,  ECn(cn )) 


will  unambiguously  refer  to  the  appropriate  LPF  defined  in  S£b- 

We  now  redefine  our  notion  of  a  Bayesian  network. 

Definition  7.4.  Given  a  Bayesian  network  B  =  (V.  P),  let  £B  be  a  collection  of 
encoders  and  S£b  be  a  collection  of  spline-sets.  We  define  a  splined  Bayesian  network 
to  be  a  3-tuple  B  =  (V,£b,S£b). 

Definition  7.5.  Given  a  complete  instantiation  set  w  for  B,  we  define  the  spline  prob¬ 
ability,  Ps  for  B  as 


Ps(w)  =  S£B(w(A),w(cond(A))).  (21) 

Aespan(ui) 


Theorem  7.1.  Any  Bayesian  network  can  be  modeled  as  a  splined  Bayesian  network. 
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We  now  proceed  to  show  how  we  can  transform  splined  Bayesian  networks 
into  linear  constraint  satisfaction  problems. 

Clearly,  a  splined  Bayesian  network  B  =  ( V,  .  ScB )  induces  a  partition  on 
the  original  conditional  probabilities  P.  Let  [P]g  denote  this  induced  partitioning. 
Furthermore,  a  LPF  from  ScB  is  uniquely  associated  with  each  cell  in  the  partition.  If 
d  is  a  cell  in  [P]g,  then  Sd  will  uniquely  denote  the  appropriate  LPF. 

We  say  that  a  r.v.  instantiation  {A  =  a}  appears  in  a  cell  in  \P]p  if  there  exists 
a  conditional  probability  in  the  cell  of  the  form  P(Cq  =  co  |  C\  =  c\, ,  Cn  =  cn) 
where  for  some  i  =  1, , . . ,  n,  Ci  =  A  and  ct  =  a. 


Definition  7.6.  Given  a  r.v.  A  and  a  splined  Bayesian  network  B,  we  define  the  par¬ 
tition  on  R(A)  induced  by  B  as  follows:  01,02  £  R(A)  both  belong  in  the  same 
partition  if  and  only  if  for  all  cells,  d  in  [P\g,  {A  =  ai}  appears  in  d  if  and  only 
if  {A  =  a 2 }  appears  in  d.  We  call  this  partitioning  the  capsulation  of  A  by  B  and 
denote  it  by  [A\g- 

Theorem  7.2.  Given  any  cell  d  =  {aj, . . .  ,a/J  in  [A]g  where  £(4(0*)  <  Ea(ch+i), 
there  does  not  exist  b  £  R(A)  such  that  Ea(cli)  <  EA{b)  <  £4(0^)  and  b  ^  ai  for 
i  =  1, . . . ,  k. 

Without  going  into  detail,  we  must  generalize  our  notion  of  constraint  systems. 
Previously  for  belief  revision,  we  restricted  our  variables  to  values  of  0  and  1.  We 
generalize  this  by  allowing  variables  to  be  restricted  to  sets  of  real  values.  In  doing  so, 
we  also  generalize  our  notion  of  a  0-1  solution  to  the  notion  of  a  permissible  solution 
as  a  solution  which  satisfies  all  the  value  restrictions  as  well  as  constraints. 

We  begin  our  construction  as  follows:  Let  B  =  ( V,  Sp ■  )  be  a  splined 

Bayesian  network  and  S£b  be  a  linear  potential  space.  For  each  r.v.  A  in  V,  construct 
a  real  variable  xa  whose  values  are  restricted  to  {Ea{cl)  \  Ea  £  Eb  an^  a  £  R(A)}. 
Next,  arbitrarily  label  all  the  cells  in  \A]p  by 


{^4,1  >  dA,2,  •  •  •  i  d,A,n } • 


For  each  cell  in  {dA,i,dA,2,  ■  ■  ■ ,  construct  a  new  0-1  restricted  variable  Xdai. 

Construct  the  following  constraints: 


For  each  dA,i, 


n 


J2XdA,i  =  '• 

i=  1 

(22) 

xA  -  min  EA(a)  ^  -M(  1  -  xdAi ), 

aed,A,i 

(23) 

xa  -  max  EA(a)  ^  M(  1  -  xdAi), 

a^A,i 

(24) 
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where  M  is  some  arbitrarily  large  positive  constant.  Intuitively,  xciA  .  is  used  to 
“detect”  whether  xA  falls  within  a  particular  interval  defined  by  the  partitioning  and 
x  a  represents  the  instantiated  value  if  any. 

From  our  definitions  above,  we  can  unambiguously  associate  a  LPF  S  in  S£b 
with  some  table  T.  Furthermore,  using  our  encoder  mappings,  we  describe  the  domain 
of  S'  as  a  cross  product  of  intervals  on  real  variables  associated  with  the  r.v.s  in  T.  For 
example,  let  p(T)  be  the  partition  on  T  associated  with  S.  According  to  Definition  7.3, 
for  each  r.v.  B  in  R v(T),  there  exists  a  set  of  instantiations  of  B,  {B  =  6^, . . , ,  B  = 
bik}  in  the  cell  associated  with  S  in  p(T)  such  that  there  does  not  exist  an  instantiation 
{B  =  c}  where  c  ^  bij  for  all  ij  and  Eb{c)  is  in  the  interval  from  min,..  E  [}(}),. ) 
to  ma Xj  .  Eft(bl:i ).  Thus,  we  only  need  the  min  and  the  max  values  to  describe  the 
domain  for  S.  Also,  remember  that  we  want  to  incorporate  S  into  our  probabilistic 
computations  only  when  the  r.v.s  are  instantiated  within  its  restricted  domain.  Let 
us  denote  the  interval  for  a  r.v.  A  for  LPF  S  by  [min  (A,  S ) .  m ax ( /I .  .S' ) ] .  We  now 
proceed  with  our  construction.  For  each  LPF  S  involving  the  r.v.s  {Cj , . . . ,  Cn } ,  we 
construct  the  new  variables  xc^s  which  are  virtual  copies  of  xc,  constructed  above. 
For  each  [min(C'i,  S),  max(C'j,  S')],  if  there  does  not  yet  exist  a  0-1  variable  which 
detects  whether  xc,  is  in  [min(G'(.  S),  maxfC',,,  ,S')],  create  a  new  0-1  variable  dc,.s 
with  the  following  constraints:  For  each  d.A,i, 


xCi  -  min(Cj,  S)  ^  -M(  1  -  dCi,s), 

(25) 

xCi  -  ma x(Ci,  S)  <  M(  1  -  dCi,s)- 

(26) 

Now,  continue  and  construct  a  new  0-1  real  variable  ds  adding  the  following 
straints: 

n 

con- 

ds  >  n  -  ^ ~^dcj,s  +  1, 
j  l 

(27) 

1  n 

ds  <  ~y^dCi,s- 

j= l 

(28) 

For  each  i. 

xCi,S  >  xCi  ~M[\  -  ds), 

(29) 

xcuS  <  xCi  +M{  1  -ds), 

(30) 

xCi,S  ^  Mds, 

(31) 

ds  is  used  to  indicate  whether  S  will  be  used.  If  so,  copy  the  values  into  the  virtual 
copies  and  make  the  computations  necessary  based  on  the  virtual  copies. 

We  complete  our  transformation  by  defining  an  appropriate  objective  function. 
For  each  LPF  S  in  S£b’  introduce  the  following  terms  into  the  objective  function: 
Assume  that 

S1£b  {EA(a),ECl  (Cl), . . . ,  EcM)  =  efeoEA(a)+feific1(c1)+-+fc„sc„(c„)+fc_ 
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The  terms  to  be  added  are 

n 

-  kds  -  ^  kixcus •  (32) 

i—  1 

We  now  must  show  that  the  solution  space  defined  by  our  induced  constraint 
system  is  equivalent  to  the  space  of  all  complete  instantiation-sets  for  the  Bayesian 
network.  We  begin  by  providing  a  transformation  from  permissible  assignments  in 
our  constraint  systems  to  instantiation-sets.  Let  s  be  a  permissible  solution.  We  can 
construct  a  complete  instantiation-set  w[s]  as  follows:  For  each  r.v.  A  in  V,  A  is 
instantiated  if  and  only  if  s(xa)  >  0-  Since  our  encoders  are  one-to-one  and  onto, 
then  the  inverse  (or,  called  decoder )  exists  and  we  denote  them  by  .  If  s(xa)  >  0, 
then  io[s](A)  =  E^(xa)- 

Conversely,  given  a  complete  instantiation-set  w,  we  can  construct  a  permissible 
assignment  s[iu]  as  follows:  For  each  r.v.  A  in  V,  s[w](xa)  =  Ea(vj(A)).  Further¬ 
more,  we  properly  activate  the  appropriate  interval  detectors.  Finally,  according  to 
the  instantiation-set  w,  we  can  easily  determine  which  LPFs  are  active.  s[w](ds)  =  1 
if  and  only  if  S  is  an  active  LPF  according  to  w.  And,  if  S  is  active,  then  copy 
s[in](xci,s)  =  s[u>](xCi)  for  all  i  involved  with  S.  Otherwise,  s['(n](xci,s)  =  0. 

Theorem  7.3.  w  is  a  complete  instantiation-set  for  B  if  and  only  s  [w]  is  a  permissible 
solution  for  the  induced  constraint  system. 

Flaving  shown  the  equivalence,  we  can  prove  the  following  theorem  on  the 
probabilities  being  calculated. 

Theorem  7.4. 

Ps{w)  =  e-0(5M). 

Therefore,  the  optimal  permissible  solution  for  our  induced  constraint  system 
will  be  the  best  complete  instantiation  set. 

Finally,  we  must  also  incorporate  the  notion  of  evidence.  Evidence,  we  recall, 
is  the  requirement  that  a  r.v.  be  instantiated  with  a  certain  value.  For  the  case  where 
a  r.v.  A  must  be  instantiated  to  a,  we  simply  include  the  constraint  x,\  =  E  \(a). 

One  final  note  for  this  section  is  on  the  use  of  multiple  encoders  for  a  single 
r.v.  By  associating  a  separate  encoder-set  to  each  conditional  table,  this  will  obviously 
increase  the  flexibility  of  our  formulation  to  compress  the  tables.  Furthermore,  mul¬ 
tiple  encoders  will  only  increase  our  complexity  linearly,  we  just  have  to  guarantee 
consistency  between  encoders.  If  one  encoder  says  that  A  is  instantiated  to  a,  any 
other  encoders  must  also  instantiate  A  to  a.  For  example,  say  we  have  two  encoders 
E\  and  E2a  and  let  a  6  R(A).  Let  x\  and  x\  be  the  encoder  variables  associated 
to  the  two  encoders,  respectively.  Basically,  we  must  make  sure  that  x\  =  E\{a)  iff 
x\  =  E\{a).  We  can  accomplish  this  by  the  following  constraints: 

XA  ~  E\(a)  ^  ~M(  1  -  VA=a), 

XA  ~  EA(a)  ^  M(1  -  yA=a ), 
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XA  ~  E\{a )  >  ~M{\  -  yA=a ), 
x\  ~  E\{a)  <  M(  1  -  J/A=a)i 

where  yA=a  is  a  0-1  detector  variable.  Obviously,  this  can  be  generalized  to  any 
number  of  encoders. 


8.  Conclusions 

Reducing  the  storage  and  processing  costs  of  probabilistic  information  is  nec¬ 
essary  for  reasoning  systems  especially  when  we  move  to  real  world  problems.  This 
can  be  achieved  by  identifying  patterns  in  the  probabilistic  tables.  Splining  functions 
provide  a  flexible  approach  to  exploiting  these  patterns.  In  particular,  LPFs  can  reduce 
the  storage  consumption  from  a  typically  multiplicative  growth,  0(siS2-  ■  -sn),  to  an 
additive  one,  ()(s\  +  S2  +  ■  ■  ■  +  sn). 

In  addition  to  the  fact  that  single  LPFs  can  be  computed  relatively  easily  via 
closed-form  solutions,  we  show  that  the  LPFs  now  provide  us  with  an  ordering  on  the 
probabilistic  information  not  available  before.  This  ordering  can  be  used  to  help  in 
searching  the  tables. 

A  natural  extension  to  single  LPFs  are  multiple  LPFs  with  multiple  encoders. 
Multiple  LPFs  provide  better  approximations  while  still  minimizing  storage  and  pro¬ 
cessing  costs.  The  difficulty  in  using  multiple  LPFs  is  in  trying  to  determine  which 
LPF  we  are  supposed  to  be  computing  with  at  any  given  moment.  In  particular,  given 
a  collection  of  LPFs  over  a  probability  table,  to  retrieve  a  particular  entry  in  the  table 
via  LPFs  requires  us  to  know  which  LPF  we  must  look  at.  By  carefully  partitioning  the 
table  into  contiguous  hypercubes,  we  can  solve  this  identification  problem  by  using 
quick  and  easy  bounds  tests. 

When  the  knowledge  engineer/domain  builder  can  readily  identify  patterns 
within  a  table,  we  can  immediately  apply  multiple  LPFs.  In  case  such  information 
isn’t  handy,  we  can  identify  patterns  automatically  using  techniques  such  as  bisecting 
splines  and  merging  splines. 

Experimental  results  for  multiple  LPFs  seem  very  promising.  Going  to  a  2-LPF 
approximation  alone  resulted  in  an  average  500-600%  improvement  in  overall  fit  to 
the  conditional  probability  tables. 

Both  single  and  multiple  LPFs  can  be  used  directly  in  integer  1  inear  programming 
models  for  reasoning.  Integer  linear  programming  has  been  shown  to  be  an  efficient 
method  for  reasoning  and  seems  to  scale  to  larger  problems  quite  well  [5,  9-12], 

Additionally,  since  LPFs  are  used  in  an  effort  to  identify  patterns,  this  impacts 
on  the  problem  of  missing/incomplete  information.  Without  complete  information, 
reasoning  models  such  as  Bayesian  networks  cannot  be  used.  By  using  the  patterns 
we  identified  through  our  approximations,  we  can  then  extrapolate  the  missing  infor¬ 
mation. 
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Finally,  the  LPF  approach  can  also  be  applied  to  any  domain  besides  probabilistic 
ones  which  require  large  amounts  of  data.  Our  approach,  as  we  have  presented  here, 
is  quite  general  and  independent  of  the  probabilistic  framework  we  targeted. 

Some  of  the  questions  we  continue  to  pursue  include  proving  whether  or  not 
the  problem  of  finding  the  best  set  of  multiple  LPFs  is  NP-hard.  Another  thread 
involves  merging  our  approach  with  a  more  advanced  independence-based  assignments 
methodology  called  <5-IB  assignments  [16],  Instead  of  requiring  that  the  entries  be 
identical,  they  can  be  similar  within  some  6.  This  approach  can  help  provide  bounds  on 
the  expected  errors  from  our  approximations  and  how  it  impacts  the  overall  reasoning 
systems. 


Appendix.  Proofs 


Theorem  3.1.  The  minimal  solution  to  (10)  will  be  for  each  r.v.  C).  in  T,  and  for  each 

Ck  €  R{Ck), 


Eck(ck )  = 


1 


nn 

i= o  rrii 

i^k 


Y  g{do,...,dn 


n£(T) 


( do,...,dn)^T 

dk=£k 


(n+  l)nU"H’ 


(33) 


where  m;  =  \R{Ci)\  for  i  =  0, . . . ,  n  and 

l(T)=  Y  9(do,-..,dn). 

(do,...,dn)tzT 

Proof.  We  begin  by  making  the  following  observation:  First,  the  encoders  are  already 
variables  themselves.  We  find  that  we  can  simply  absorb  the  constants  Ay’s  in  front 
of  them  in  (10).  Furthermore,  k  is  a  translation  factor  which  can  also  be  absorbed  by 
the  encoders.  This  leaves  us  with 


l2 


t-  E 

(co ,...,C„)6T  Lj=o 

Taking  the  partial  first  derivatives,  we  get 
3 


Y  Eai  _  •  •  • 


3  ECk(ck) 


/  =  *  E 


(dQ,...,dn)£T 

dk=C-k 


YEcj(dj)  -g(co,...,Cn) 
3=0 


Now,  substitute  in  the  encoder  assignments  above  to  the  partial  derivatives. 


E 


YEcj{dj)  -  g(co,  ■  ..  ,Cn) 


(do,...,dn)eT  L  j=0 
dk=C-k 
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ECk  ( Cfc )  n  rn,  +  Y  Y  E°j  E )  II 


nij 


*=0  J=0  djtzR(Cj) 


i=0 

i^k 

i¥=j 


Y  g(do,...,dr 


(do,...,dn)(zT 
d*k  =Ck 


Y  g(do,...,dn)  - 


(do,...,dn)ET 

dk=(-'k 


nj{T) 
(n+  1  )mk 


n  1 

+  E  E  —  E  s(e  o,...,e„) 

(eo,-,e 
ei= 

nl{T) 


mi? 

j= 0  dj&R(Cj)  (eo . en)£T 

e,=d,- 


E  E 

i=0  djeRiCj)  K  >3“  (d0,...,dn)£T 

dk=ck 


Y  g(do,...,dn 


1  A  1 


E—  E  ^  +  — E«r) 

4—'  rn,  .  n  +  1  mb  „ 


ngr) 

("  +  1)mt  m*  £S  m>  J^cj)  n  +  1  '  mt  0-0 

_ ]_  y~^  n£E  +  _J_  y-> 

(n+l)mfc  mk^n+  1  rnk  E 


j=o 

nl(T)  n2l{T)  n£(T) 


+ 


i=o 


=  0. 


□ 


(n+l)mfc  mfc(n+l)  mk 

Theorem  4.2.  Given  any  hypercube  partition  p(i)  =  { rx | , . . . ,  rrs }  and  the  following 
encodings 


ECkick )  — 


-r-rn  1  J  E  In  P(Cq  =  d0  |  . . . ,  Q  =  d;, . . .)  1 
1 1  i=o  m  V  , 

11  *^fc  *  (  rfoeR*(C0)  J 


rfoeR*(C0) 

d„eR*(C„) 

dk—ck 


n?{T) 


(n  +  l)niL0< 

where  -R^Q)  =  {q  €  R(Cj)  |  XcE)  =  1},  mf  =  |-R*(Ci)|,  and 
?E)  =  £  lnP(C7o  =  d0|...,C't  =  d,).. 

(co,...,c„)6crt 


(34) 
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this  is  an  extreme  point  for  (14). 


Proof.  We  begin  by  rewriting  (14)  in  terms  of  projection  functions  as  well  as  with 
Lagrange  multipliers: 

m  f  n 

/=X  X  UxhM) 

t=l  (co,...,c„)6T  L  *=0 
n 

x  ^  kffi.  (Cj)  +  k f  -  In P(Co  =  co  I  . . . ,  Cl  =  d, , . .) 

-3=  0 

m  n 

+  XI  A(co,-,c„)  X  TT  xb,  (ci)  - 1 

(co , •  •  •  ,cn) EX'  -  /  —  I 

m  n 

+  XX  X  ^(ciJCXCiCci)2  -XCiCci)]-  (35) 

i=l  i=0  a£R(Ci) 

We  make  the  following  observations:  First,  the  encoders  are  already  variables  them¬ 
selves.  We  find  that  we  can  simply  absorb  the  constants  fcj’ s  in  front  of  them  in  (35). 
Furthermore,  If  is  a  translation  factor  which  can  also  be  absorbed  by  the  encoders. 
This  leaves  us  with 

m  n 

/=x  x  n  xq^) 

t=l  (c0,...,c„)eT  L  i=0 
n 

x  ^  &c.[cj)  -  lnP(Co  =  c0  |  . . .  ,Ci  =  ci, . . .) 

-  3=0 

m  n 

+  X  A(co,-,c„)  X  fixate)  ~ 1 

(co,...,c„)6T  L  t=l  i=0 

m  n 

+  X  X  X  X  (ci)  [xbi  (ciY  -  XCi  (ci)]  •  (36) 

t=  1  i=0  d£R(Ci) 

Now,  taking  the  partial  derivatives,  we  get 


n 

x  J2  X  (dj)  -  In  P(C0  =  do  I  •  •  • ,  Cx  =  dh . . .) 

-  i=o 
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+  E  \<k,...,dn)  II  XcM> 


(do,...,dn)GT 


+  2\vc  (cP)xc  (cP)  ~  \lc  (Cp), 


m^77/  =  EII  xk(a)-h 

(C0,...,Cn)  t=l  j=0 

0 

MM  f  =  xh(«)2-x'c,(«), 


/=  e  2  n*<«) 


(do,...,dn)£T  Li — 0 


x  J2Ec2d^-lnP^  =  d^\  •••,Q  =  4,--0  •  (40) 

-  j=0 

An  extreme  point  will  be  any  solution  that  satisfies  the  partial  derivatives  being 
set  to  0. 

Since  p(T)  is  a  hypercube  partition  of  T,  that  is,  we  can  associate  characteristic 
functions  to  each  cell  in  p(T ),  (38)  and  (39)  are  automatically  satisfied. 

Next,  rewrite  (37)  as  follows: 


CpK  p>  (do,...,dn)eT  L  i= 0 


x  ^2ECj(dj)  -  lnP(Co  =  d0\ . . .  ,Ct  =  dp. . .) 
-j= o 


+  E  2 

(do,...,dn)(EiT 


(do,...,dn) 


n*«)  - 


We  can  arbitrarily  assign  values  for  ^  and  satisfy  (37)  by  (41). 

All  we  have  left  to  do  is  to  show  that  the  choice  of  encoder  mappings  above 
will  satisfy  (40). 

For  a  particular  cell  av  in  p(T),  (40)  reduces  to 

n 

E  J2E%j(dj)-hiP(Co  =  do\ ...,Ct  =  di,...)  .  (42) 

(do,...,dn)€CTv  Lj'=0 
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Observe  that  there  are  no  interdependencies  via  (42)  between  encoders  from  different 
encoder-sets.  Hence,  we  can  group  together  all  the  equations  from  (42)  by  encoder- 
sets. 

Now,  let’s  plug  in  the  encoder  mappings  above  into  (42). 


E 


(do,...,dn)&<Tv 


E  Ec:  (c,)  -  In  P(C0  =  d0  |  •  •  ■ ,  Ct  =  dh . . .) 

3= 0 


evCp(cp)  n  m'i + e  e  n  mvi 


i= 0 
i^p 


j= 0 
j^p 


1=0 

l^p 


E  lnP(C0  =  d0\...,Cl  =  dl,...) 

(do,...,dn)£crv 


V  lnP(C70  =  do|...,C',  =  dt,...)- 


(cfo  j  •  •  •  )dn  )  £  (Tv 


n£v(T) 
(n  +  l)m!; 


mi 


+  E  E  ^ 

3=0 
j^P 

nC{T) 


3=0  dj£Rv(Cj)  p  L  (eo,...,e„)€crv 


E  lnP(Co  =  eo  \  ■  ■  ■  ,Q  =  ei,. . .) 


E  lnP(C0  =  do  |  ...,Q  =  di,...) 


(n  +  1  )mv- 

ne'{T)  +  —  Y  [f’(T)  - 

(n  +  l)m£  m”  ^ 

j¥=p 


n£v(T)  n£v(T)  n2£v(T) 


niv{T ) 


n  +  1 


(n  +  1  )m£  m £  (n  +  l)m£ 

n£v(T)  n(n  +  l)^(T)  n2^(T) 


(n  +  l)m”  (n+l)m£  (n+l)m£ 

nf’(T)  |  n2^(r)  |  n£v(T)  n2<f(T) 


(n  +  l)m!  (n+l)mT  (n+l)mT  (n+l)m^ 


=  0. 


Thus,  we  have  an  extreme  point. 


□ 
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5.  ILP  Formulation  for  multiple  LPFs 

In  this  section,  we  consider  an  integer  linear  programming  approach  for  de¬ 
termining  the  best  multiple  LPFs.  In  particular,  we  consider  the  case  where  we  are 
searching  for  the  best  two-LPF  partitioning. 

Our  approach  will  be  to  map  the  problem  into  a  mixed  integer  linear  program¬ 
ming  problem.  First,  we  must  define  the  variables  we  will  use  in  our  formulation.  For 
each  Ci  involved  in  the  current  table,  we  associate  a  0-1  valued  variable  Qci  repre¬ 
senting  whether  R{Ci)  is  to  be  split  into  two  disjoint  sets.  We  make  the  observation 
that  for  a  two  partition  contiguous  solution,  exactly  one  of  the  R(Ci)’ s  will  be  divided 
into  two  subsets. 

Since  we  will  have  two  LPFs,  say  Si  and  S2,  let  Sct  (c,)  be  an  integer  variable 
representing  which  LPF  C,  =  is  associated  to.  0  implies  Si,  1  implies  S2,  and 
2  implies  that  it’s  associate  to  both.  Let  Eq.  (c$)  and  E}:  (c,  )  represent  the  encoder 
variables  for  Si  and  S2,  respectively. 

Let  d(P(x),  Sj(x ))  where  x  =  (cq,  •  •  • ,  cn)  G  T  be  a  real  variable  representing 
the  distance  or  error  in  LPF  Sj  on  entry  x.  Since  Sj  is  not  applied  to  those  entries  for  it’s 
counterpart  LPF,  this  variable  is  0.  Otherwise,  it’s  value  is  |  —  In  P(x)+^2c.ex  Eci(ci)\. 
Let  D(P(x),  Sj(x))  be  a  0-1  variable  representing  that  we  must  consider  the  above 
distance  measure  if  x  is  supposed  to  be  approximated  by  Sj. 

Now,  with  the  variables  defined,  we  now  consider  the  constraints.  If  Qc{  =  0, 
then  R(Ci )  is  not  split  which  further  implies  that  Sc,  (ct)  =  2  for  all  c,  G  R(Ci ): 

SCi{ci)>2-2QCi.  (43) 

Otherwise,  it  must  be  either  0  or  1: 

Sct(ct)^2-QCl.  (44) 

For  the  distance  variables,  if  Qct  =  1,  then  Sc, (c-i)  will  determine  which  LPF 
is  to  be  used: 

D(P(x),Si(x))  <  2(1  -  QCi)  +  (1  -  SCi(ci),  (45) 

D(P(x),  S2(x))  ^  SCi(ci),  (46) 

D(P(x),Si(x))  +  D(P(x),S2(x))  =  1.  (47) 

If  D(P(x),Sj(x))  =  0,  then  d(P(x),  Sj(x))  =  0. 

d(P(x),  Sj (x) )  ^  MD{P(x ),  Sj(x)),  (48) 

where  M  is  some  arbitrarily  large  constant.  Otherwise,  d(P(x),  Sj(x))  =  —  In  P(x)  + 

J2ciGxEcM)\. 

d(P(x),  Sj(x))  > -In P(x)  +  J2ECi(ci)-M (l -D(P(x),Sj(x))),  (49) 

Ci£x 


E.  Santos,  Jr.  /  Multiple  spline  approximations 


299 


d{P(x),Sj(x))  >  -J2EcM)  +lnP(s)  -M(l  -D(P{x),Sj(x))).  (50) 

Ci£x 

Since  only  one  of  R(Ci)  will  be  split, 

n 

£<?<*  =  1.  (51) 

i= 0 

Finally,  the  objective  function  we  wish  to  minimize  is 

Y,d(P(x),Sj(x)).  (52) 

xeT 

If  there  exists  a  2-LPF  perfect  fit  for  T,  solving  the  above  mixed  integer  pro¬ 
gramming  problem  will  find  it.  We  do  note  though  that  in  the  case  when  such  a 
perfect  fit  does  not  exist,  our  optimal  solution  is  based  on  a  slightly  different  metric 
from  the  original  problem.  Flowever,  any  solution  generated  by  this  approach  should 
be  sufficient  for  most  puiposes. 

Finally,  we  can  generalize  this  formulation  to  more  than  2  LPFs  by  encoding 
the  characteristic  functions  into  the  problem. 
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